AfterWork Data Science: Data Visualisation with Python Project

1. Defining the Question

a) Specifying the Data Analysis Question

Hilton International Hotels management needs help to decide on the strategies that will lead to revenue growth.

b) Defining the Metric for Success

The solutions to the following questions will help us answer our research question:

  • When is the best time of year to book a hotel room?
  • When is the optimal length of stay in order to get the best daily rate?
  • How will you know if a hotel was likely to receive a disproportionately high number of special requests?

c) Understanding the context

You have a dataset containing information on the various chain of hotels, including when customers made the booking, length of stay, the number of adults, children, or babies, and the number of available parking spaces, among other things. Given this dataset, perform data exploration, data cleaning, and analysis to come with appropriate recommendations.

d) Recording the Experimental Design

The steps to be taken include: 1. Load dataset and preview its summarized information to get a feel of what you will be working with. 2. Carry out data cleaning. 3. Carry out data analysis. 4. Interpret results. 5. Provide recommendations based on results of analysis. 6. Challenge your solution.

e) Data Relevance

For now, the data we have contains booking data which will be critical for our research specific analysis.

2. Reading the Data

Let’s first load the necessary libraries

library("tidyverse")  # to handle manipulation and visualization
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("reshape2") #to handle manipulation
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library("countrycode") #to handle country codes
# Load the data below
# --- 
# Dataset url = 
# --- 
# YOUR CODE GOES BELOW
df_holtel <- read_csv("hotel_bookings 2.csv")
## Rows: 119390 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df_holtel)
# Checking the first 5 rows of data
# ---
# YOUR CODE GOES BELOW
# 
head(df_holtel, 5)
# Checking the last 5 rows of data
# ---
# YOUR CODE GOES BELOW
# 
tail(df_holtel, 5)
# Sample 10 rows of data
# ---
# YOUR CODE GOES BELOW
# 
sample_n(df_holtel, 10)
# Checking number of rows and columns
# ---
# YOUR CODE GOES BELOW
#  
dim(df_holtel)
## [1] 119390     32
# Checking datatypes
# ---
# YOUR CODE GOES BELOW
# 
glimpse(df_holtel)
## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
## $ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
## $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
## $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type                  <chr> "Transient", "Transient", "Transient", …
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
## $ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…
str(df_holtel)
## spec_tbl_df [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>

Record your general observations below: The columns have specific datatypes Some columns contain NULL as a value. This requires cleaning.

3. External Data Source Validation

The data is originally from the article Hotel Booking Demand Datasets, by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.

4. Data Preparation

Performing Data Cleaning

# Checking datatypes and missing entries of all the variables
# ---
# YOUR CODE GOES BELOW
# 

str(df_holtel)
## spec_tbl_df [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>
# Let's remove the date column to allow NULL replacement
df_holtel <- df_holtel[,!(names(df_holtel) %in% c('reservation_status_date'))]

# we notice company have so many NULL as values we replace them with Na
df_holtel[df_holtel == 'NULL'] <- NA


colSums(is.na(df_holtel))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                            488 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                          16340 
##                        company           days_in_waiting_list 
##                         112593                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status 
##                              0

We observe the following from our dataset: Company, country and agent column contains “NULL” as data entries which indicates missing values, converting company NULL to NA shows 112593 missing values Children column have 4 Na values.

# Checking how many duplicate rows are there in the data
# ---
# YOUR CODE GOES BELOW
# 
df_holtel[duplicated(df_holtel), ]
#dim(df_holtel[duplicated(df_holtel), ])

We choose to keep the duplicates because we don’t have a unique identifier to actually proof that we have duplicates.

# Checking if any of the columns are all null
# ---
# YOUR CODE GOES BELOW
# 
colSums(is.na(df_holtel))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                            488 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                          16340 
##                        company           days_in_waiting_list 
##                         112593                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status 
##                              0

We observe the following from our dataset: Company has 112593 missing values Country has 488 missing values agent has 16340 missing values

# Checking if any of the rows are all null
# ---
# YOUR CODE GOES BELOW
# 
# df with Null values
df_null_holtel <- df_holtel[rowSums(is.na(df_holtel)) > 0,]
dim(df_null_holtel)
## [1] 119173     31

We observe the following from our dataset: There are 119173 rows with missing values.

# Checking the correlation of the features through the use of 
# visualizations the correlation using heatmap
# ---
# YOUR CODE GOES BELOW
# 

# Create a correlation matrix
corr_matrix <- cor(df_holtel[, unlist(lapply(df_holtel, is.numeric))])
head(corr_matrix)
##                            is_canceled   lead_time arrival_date_year
## is_canceled                1.000000000 0.293123356      0.0166598602
## lead_time                  0.293123356 1.000000000      0.0401420998
## arrival_date_year          0.016659860 0.040142100      1.0000000000
## arrival_date_week_number   0.008148065 0.126870813     -0.5405613308
## arrival_date_day_of_month -0.006130079 0.002267553     -0.0002210192
## stays_in_weekend_nights   -0.001791078 0.085671133      0.0214973654
##                           arrival_date_week_number arrival_date_day_of_month
## is_canceled                            0.008148065             -0.0061300789
## lead_time                              0.126870813              0.0022675527
## arrival_date_year                     -0.540561331             -0.0002210192
## arrival_date_week_number               1.000000000              0.0668092530
## arrival_date_day_of_month              0.066809253              1.0000000000
## stays_in_weekend_nights                0.018207653             -0.0163542995
##                           stays_in_weekend_nights stays_in_week_nights
## is_canceled                          -0.001791078           0.02476463
## lead_time                             0.085671133           0.16579936
## arrival_date_year                     0.021497365           0.03088330
## arrival_date_week_number              0.018207653           0.01555830
## arrival_date_day_of_month            -0.016354300          -0.02817352
## stays_in_weekend_nights               1.000000000           0.49896882
##                                 adults children        babies is_repeated_guest
## is_canceled                0.060017213       NA -0.0324910892      -0.084793418
## lead_time                  0.119518693       NA -0.0209150163      -0.124409908
## arrival_date_year          0.029635144       NA -0.0131920747       0.010341317
## arrival_date_week_number   0.025909057       NA  0.0103954801      -0.030130758
## arrival_date_day_of_month -0.001565979       NA -0.0002303647      -0.006145021
## stays_in_weekend_nights    0.091871020       NA  0.0184828105      -0.087239379
##                           previous_cancellations previous_bookings_not_canceled
## is_canceled                           0.11013281                  -0.0573577232
## lead_time                             0.08604180                  -0.0735481679
## arrival_date_year                    -0.11982207                   0.0292180512
## arrival_date_week_number              0.03550091                  -0.0209035517
## arrival_date_day_of_month            -0.02701078                  -0.0002997868
## stays_in_weekend_nights              -0.01277462                  -0.0427152350
##                           booking_changes days_in_waiting_list         adr
## is_canceled                 -0.1443809911           0.05418582  0.04755660
## lead_time                    0.0001488301           0.17008418 -0.06307685
## arrival_date_year            0.0308723496          -0.05649660  0.19758009
## arrival_date_week_number     0.0055075035           0.02293303  0.07579064
## arrival_date_day_of_month    0.0106128560           0.02272754  0.03024519
## stays_in_weekend_nights      0.0632813159          -0.05415111  0.04934191
##                           required_car_parking_spaces total_of_special_requests
## is_canceled                              -0.195497817              -0.234657774
## lead_time                                -0.116450570              -0.095712049
## arrival_date_year                        -0.013684411               0.108531486
## arrival_date_week_number                  0.001920423               0.026149364
## arrival_date_day_of_month                 0.008683466               0.003062124
## stays_in_weekend_nights                  -0.018553809               0.072670830
# Create a table with long form
corr_df <- melt(corr_matrix)
head(corr_df)
# Plot the heatmap
ggplot(corr_df, aes(Var1, Var2, fill = value)) + 
  geom_tile(color = "black") + 
  geom_text(
    aes(label = round(value, 2)), 
    color = "white"
  ) +
  coord_fixed() + 
  labs(
    fill="Pearson Correlation"
  ) +
  scale_fill_gradient2(
    low = "blue", 
    high = "red",
    mid = "white", 
    limit = c(-1,1)
  ) + 
  theme(
    axis.title.x = element_text(angle = 61, vjust = 0.5, hjust=1),
    axis.title.y = element_blank()
  )
## Warning: Removed 34 rows containing missing values (geom_text).

We observe the following from our dataset:

Increase in lead time increases the chances of a booking cancellation Repeat guest have a lower chance of cancelling a booking. Guests who had previously canceled have a higher likelihood of cancelling again.

# Dropping company column because it has alot of missing values 
# and we won't need to answer any of our questions
# ---
# YOUR CODE GOES BELOW
# 

df_holtel <- df_holtel[,!(names(df_holtel) %in% c('company'))]
names(df_holtel)
##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "days_in_waiting_list"           "customer_type"                 
## [27] "adr"                            "required_car_parking_spaces"   
## [29] "total_of_special_requests"      "reservation_status"

From the data variable description we see that the Distribution Channel categoricy that tells us about Booking distribution.

The term “TA” means “Travel Agents” The term “TO” means “Tour Operators” This allows us to fill the missing values in the agents column with TO

# We replace the mising values i.e. for TO
# ---
# YOUR GOES BELOW
#
sum(is.na(df_holtel$distribution_channel))
## [1] 0
unique(df_holtel$distribution_channel)
## [1] "Direct"    "Corporate" "TA/TO"     "Undefined" "GDS"

Distribution Channel doesn’t have any missing value

# We drop rows where there is no adult, baby and child as 
# these records won't help us.
# ---
# YOUR GOES BELOW
#
sum(is.na(df_holtel$adults))
## [1] 0

This analysis conflicts with below one, since we have children with missing value only we will skip this

# We replace missing children values with rounded mean value
# ---
# Hint i.e. use round()
# ---
# YOUR GOES BELOW
#
df_holtel$children[is.na(df_holtel$children)] <- round(mean(df_holtel$children, na.rm = TRUE))
sum(is.na(df_holtel$children))
## [1] 0
# Checking for missing values in the dataframe
# ---
# YOUR GOES BELOW
#
#droppin na rows
df_holtel <- na.omit(df_holtel)
colSums(is.na(df_holtel))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              0                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##           days_in_waiting_list                  customer_type 
##                              0                              0 
##                            adr    required_car_parking_spaces 
##                              0                              0 
##      total_of_special_requests             reservation_status 
##                              0                              0

No missing values

5. Solution Implementation

5.a) Questions

# 1. How many bookings were cancelled?
# ---
# Visualisation: Barplot
ggplot(data = df_holtel, aes(x = is_canceled)) +
  geom_bar(stat = "count", fill = "steelblue") +
  geom_text(stat = "count", aes(label = ..count..), hjust = 1) +
  labs(title = "Bookings cancelled",
       x = "Cancenclled State",
       y = "No. of bookings") +
  theme_classic() + scale_color_brewer(palette = "Set2")

# 2. What was the booking ratio between resort hotel and city hotel?
# ---
# Barplot of booking ratio between resort hotel and city hotel

# Visualize the cancellation by hotek type
ggplot(data = df_holtel,
       aes(
         x = hotel,
         y = prop.table(stat(count)),
         label = scales::percent(prop.table(stat(count)))
       )) +
  geom_bar(position = position_dodge(), fill = "steelblue") +
  geom_text(
    stat = "count",
    position = position_dodge(.9),
    vjust = -0.5,
    size = 3
  ) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Booking ratio between resort hotel and city hotel",
       x = "Hotel Type",
       y = "Count(%)") +
  theme_classic() 

# 3. What was the percentage of booking for each year?
# ---
# 

ggplot(data = df_holtel,
       aes(
         x = arrival_date_year,
         y = prop.table(stat(count)),
         label = scales::percent(prop.table(stat(count)))
       )) +
  geom_bar(position = position_dodge(), fill = "steelblue") +
  geom_text(
    stat = "count",
    position = position_dodge(.9),
    vjust = -0.5,
    size = 3
  ) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Percentage of booking for each year",
       x = "Year",
       y = "Count(%)") +
  theme_bw()

# 4. Which were the most busiest months for hotels?
# ---
# 
# Organize the Month in proper order
df_holtel$arrival_date_month <-
  factor(df_holtel$arrival_date_month, levels = month.name)
# Visualize Hotel traffic on Monthly basis
ggplot(data = df_holtel, aes(x = arrival_date_month)) +
  geom_bar(fill = "steelblue") +
  geom_text(stat = "count", aes(label = ..count..), hjust = 1) +
  coord_flip() + labs(title = "Month Wise Booking Request",
                      x = "Month",
                      y = "Count") +
  theme_classic()

# 5. From which top 3 countries did most guests come from?
# --- 
# YOUR GOES BELOW
#

# where are the people coming from
df_holtel_1 <- df_holtel[df_holtel$reservation_status == "Check-Out",]
# Subset the data to include the countries which has more than 1500 reservation request
# otherwise including all the country with few or occassional request to avoid the graph
# from being clumsy
sub_hotel <- df_holtel_1 %>% 
  group_by(country) %>% 
  filter(n() > 1500)

# Visualize the Travellor by Country.
sub_hotel$county_name <- countrycode(sub_hotel$country, 
                                     origin = "iso3c",
                                     destination = "country.name")

# Traveller by Country per hotel wise
ggplot(sub_hotel, aes(county_name, fill = hotel)) + 
  geom_bar(stat = "count", position = position_dodge()) + 
  labs(title = "Booking Status by Country",
       x = "Country",
       y = "Count") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.background = element_blank())

# 6.a) How long do most people stay in hotels?
#   b) By city and resort? Separate the data by hotel
# ---
# 

# Total Stay Duration
ggplot(sub_hotel, aes(stays_in_weekend_nights + stays_in_week_nights)) + 
  geom_density(col = "red") +facet_wrap(~hotel) + theme_bw()

# 7. Which was the most booked accommodation type (Single, Couple, Family)?
# ---
#
ggplot(sub_hotel, aes(customer_type, fill = hotel)) + 
  geom_bar(stat = "count", position = position_dodge()) + 
  labs(title = "Hotel Preference by Customer Type",
       x = "Customer Type",
       y = "Count") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.background = element_blank())

5.b) Recommendations

From the above analysis, below are our recommendations:

  1. Focus on City hotel, given that most customers prefer this type of the hotel. A further research is required to address why resort have a lower preference.
  2. Focus more on Transient customer type, given they are most popular type of customers.
  3. Focus on investing more in Portugal, citizens there seems to have a tendency of using hotels compared to others.
  4. Booking seems to have a seasonal behaviour, consider having discounts duing mid year May-Sept.
  5. Cancellation is significantly high, indicating dissatisfied customers with either the experience or charges compared to competition. To establish why the cancellation rate is so high more research is required. But improving customer experience is terms of reducing the lead time and offering flexible affordable charges.

6. Challenging your Solution

In this step, we review our solution and implement approaches that could potentially provide a better outcome. In our case, we could propose the following question that wasn’t answered in our solution because it couldn’t have greatly contributed to our recommendation.

# When should hotels provide special offers?
# ---
# YOUR GOES BELOW
# 
#one approach is check cancellation rate by month

ggplot(df_holtel, aes(arrival_date_month, fill = factor(is_canceled))) +
  geom_bar() + geom_text(stat = "count", aes(label = ..count..), hjust = 1) +
  coord_flip() + scale_fill_discrete(
    name = "Booking Status",
    breaks = c("0", "1"),
    label = c("Cancelled", "Not Cancelled")
  ) +
  labs(title = "Booking Status by Month",
       x = "Month",
       y = "Count") + theme_bw()

Our observations: - The best time to introduce offers is between May and September. This will reduce cancellation and increase revenue.

How does this observation tie to our solution?

7. Follow up questions

During this step, you rethink and propose other ways that you can improve your solution.

How does the charge of the hotel affect revenue? How does the environment in-terms of scenery affect revenue?

a). Did we have the right data? Yes, for the initial analysis done. b). Do we need other data to answer our question? Yes, competition data to compare charges and customers reviews to analyse sentiments and how such affect revenue. c). Did we have the right question? Yes.